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1 ABSTRACT 

o ■ 

We present a distributed implementation of Shor's quantum factoring algorithm on a distributed quantum 
network model. This model provides a means for small capacity quantum computers to work together in such 
5^ ' a way as to simulate a large capacity quantum computer. In this paper, entanglement is used as a resource 
for implementing non-local operations between two or more quantum computers. These non-local operations 
are used to implement a distributed factoring circuit with polynomially many gates. This distributed version 
of Shor's algorithm requires an additional overhead of 0((logiV) 2 ) communication complexity, where N denotes 
the integer to be factored. 
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)Q ■ i. INTRODUCTION 

To utilize the full power of quantum computation, one needs a scalable quantum computer with a sufficient 
number of qubits. Unfortunately, the first practical quantum computers are likely to have only small qubit 
capacity. One way to overcome this difficulty is by using the distributed computing paradigm. By a distributed 
quantum computer, we mean a network of limited capacity quantum computers connected via classical and 
quantum channels. Quantum entangled states, in particular generalized GHZ states, provide an effective way of 
implementing non-local operations, such as, non-local CNOTs and teleportation. 1, 2 

We use distributed quantum computing techniques to construct a distributed quantum circuit for the Shor 
factoring algorithm. Let n — log TV, where N is the number to be factored. The gate complexity of this particular 
distributed implementation of Shor's algorithm is 0(n 3 ) with 0(n 2 ) communication overhead.* 

In section 2, the general principles of distributed quantum computing are outlined, and two primitive dis- 
tributed computing operators, cat-entangler and cat-disentangler, are introduced. We use these two primitive 
operators to implement non-local operations, such as non-local CNOTs and teleportation. Then we discuss how 
to share the cost of implementing a non-local controlled U, where U can be decomposed into a number of gates. 
The section ends with an distributed implementation of the Fourier transform. 

In section 3, we give a detailed description of an implementation of Shor's non-distributed factoring algorithm. 
This implementation is based on the phase estimation and order finding algorithms. We discuss in detail how to 
implement "modular exponentiation," which implementation will be used later in this paper as a blueprint for 
creating a distributed quantum algorithm. 

In section 4, we implement a distributed factoring algorithm by partitioning the qubits into groups in such a 
way that each group fits on one of the computers making up the network. We then proceed to replace controlled 
gates with non-local controlled gates whenever necessary. 

Further author information: (Send correspondence to Anocha Yimsiriwattana) 
Anocha Yimsiriwattana: E-mail: ayimsil@umbc.edu, URL: http : /userpages .umbc . edu/~ayimsil 
Samuel J. Lomonaco Jr.: E-mail: lomonaco@umbc.edu, URL: http: /www. cs .umbc . edu/~lomonaco 

*Shor's factoring algorithm is of gate complexity 0(n 2 log n log log n) and space complexity 0(n log n log log n). 
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2. DISTRIBUTED QUANTUM COMPUTING 

By a distributed quantum computer (DQC), we mean a network of limited capacity quantum computers con- 
nected via classical and quantum channels. Each computer (or node) possesses a quantum register that can hold 
only a fixed limited number of qubits. Each node also possesses a small fixed number of channel qubits which 
can be sent back and forth over the network. Each register qubit can freely interact with any other qubit within 
the same register. Each such qubit can also freely interact with channel qubits that are in the same computer. In 
particular, each such qubit can interact with other qubits on a remote computer by two methods: 1) The qubit 
can interact via non-local operations, or 2) The qubit can be teleported or physically transported to a remote 
computer in order to locally interact with a qubit on that remote computer. 

Indeed, distributed quantum computing can be implemented by only teleporting or physically transporting 
qubits back and forth. However, a more efficient implementation of DQC has been proposed by Eisert et al 1 
using non-local CNOT gates. Since the controlled-NOT gate together with all one-qubit gates is universal set of 
gates, 3 a distributed implementation of any unitary transformation reduces to the implementation of non-local 
CNOT gates. Eisert et al also prove that one shared entangled pair and two classical bits are necessary and 
sufficient to implement a non-local CNOT gate. 

Yimsiriwattana and Lomonaco 2 have identified two primitive operations, cat-entangler and cat-disentangler, 
which can be used to implement non-local operations, e.g., non-local CNOTs, non-local controlled gates, and 
teleportation. Figure 1 illustrates cat-entangler and cat-disentangler operations. 



v\0)+(3\l 




H 



a|00000) + /3| 10111)- H 



M 



H 



M 



H 



M >rr 



A. Cat- 



A 

ent angler 



X 



X 



X 



-a\0) +(3\1) 

-|0) 

-|0) 

-|0) 

-|0) 



B. Cat-disentangler 



Figure 1. The cat-entangler and cat-disentangler operations for a 5-qubit system are shown in figure 1-A and figure 1-B, 
respectively. A dotted-line represents a measurement result, which is classical and is used to control X gates. The Z gate 
in circuit 1-B is controlled by the exclusive-or (®) of the three classical bits resulting from the measurement of qubits 
three to five. A qubit is reset to |0) by a control- X gate. This control- X gate is controlled by a classical bit arising from 
the measurement on the qubit. 



For the implementation of a non-local CNOT gate, an entangled pair must first be established between two 
computers. Then, the cat-entangler is used to transform a control qubit a|0) + /3|1) and an entangled pair 
^7|(|00) + 1 1 1 ) ) into the state a|00) + /3|11), called a "cat-like" state. This state permits two computers to share 
the control qubit. As a result, each computer now can use a qubit shared within the cat-like state as a local 
control qubit. 

After completion of the control operation, the cat-disentangler is then applied to disentangle and restore the 
control qubit from the cat-like state. Finally, channel qubits are reset by using the classical information resulting 
from measurement to control X gates. In this way, channel qubits can be reused and entangled pairs can be 
re-established. A non-local CNOT circuit is illustrated in figure 2-A. 

To teleport an unknown qubit from computer A to B, we begin by establishing an entangled pair between 
two computers. Then, we apply the cat-entangler operation to create a cat-like state from an unknown qubit and 
the entangled pair. After that, we apply a cat-disentangler operation to disentangle and restore the unknown 
qubit from the cat-like state into the computer B. Finally, we reset the channel qubits by swapping the unknown 
qubit with |0). The teleportation circuit is shown in figure 2-B. 
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A. Non-local CNOT gate B. Teleportation Circuit 



Figure 2. This figure shows both the non-local CNOT (A) and the teleportation circuits (B). In both circuits, the cat- 
entangler creates a cat-like state, which is shared between the first and the third qubits. In the non-local CNOT circuit 
(A), the third line shares with the first line the same control qubit via the cat-like state. It is used as a local control 
qubit to control the target qubit. Finally the cat-disentangler is applied to disentangle the control qubit from the cat-like 
state and return the control qubit back to the first line. In the teleportation circuit, the cat-disentangler disentangles the 
unknown qubit from the cat-like state, and transfers the unknown state to the third qubit. 



Because a cat-like state permits two computers to share a control qubit, the cost of implementing a non-local 
controlled U, where U is a unitary transformation composed of a number of basic gates, can be shared among 
these basic gates. 

For example, let us assume that a unitary transformation has the form U = U\ ■ U2 ■ U3, where Us = CNOT. 
Since the control qubit is reused, each non-locally controlled Ui gate can be implemented using asymptotically 
only i entangled pair and | classical bit, as demonstrated in figure 3. 

Before the execution of a non-local operation, an entangled pair must first be established between channel 
computers. If each machine possesses two channel qubits, then two entangled pairs can be established by sending 
two qubits. To do so, each computer begins by entangling its own channel qubits, then exchanging one qubit of 
the pair with the other computer. As a result, one entangled pair is established at the asymptotically cost of 
sending one qubit. To refresh the entanglement, the procedure is simply repeated after the channel qubits are 
reset to the state |0). 




Figure 3. Assume U = Ui ■ Ui ■ CNOT. Then a controlled U can be distributed as shown. The control line needs to be 
distributed only once, because it can be reused. This implementation allows the cost of distributing the control qubit to 
be shared among the elementary gates. 




Figure 4. This figure shows an implementation of distributed quantum Fourier transform for 4 qubits, implemented on 
two machines, using non-local R k gates. The swap gate can be implemented by teleporting qubits back and forth between 
two computers. 



2.1. Distributed quantum Fourier transform 

The quantum Fourier transform is a unitary transformation defined on standard basis states as follows, 
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where n is the number of qubits. 

An efficient circuit for the quantum Fourier transformation can be found in Nielsen and Chuang's book 9 and 
also in Cleve et al. paper. 5 We implement a distributed version of the Fourier transformation by replacing 
a controlled R^ with non-local controlled i?^, when necessary. The distributed swap gate can be implemented 
by teleporting qubits back and forth between two computers. An implementation of the distributed Fourier 
transformation of 4 qubits is shown in figure 4, where the gate Rk is defined as: 



1 
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(2) 



for k e {2,3, . ..}. 

For a more detailed discussion on distributed quantum computing, please consult Yimsiriwattana and Lomonaco. 2 



3. THE QUANTUM FACTORING ALGORITHM 

The prime factorization problem is defined as follows: Given a composite odd positive number N, find its 
prime factors. 6 

It is well known that factoring a composite number N reduces to the task of choosing a random integer a 
relatively prime to N, and then determining its multiplicative order r modulo N, i.e., to find the smallest positive 
integer r such that a r = 1 (mod N). This problem is known as the "order finding problem." 

Clcvc et al 5 have shown that the order finding problem reduces to the phase estimation problem, a problem 
which can be solved efficiently by a quantum computer. We briefly review these problems in this section. 

3.1. Phase Estimation Algorithm 

The phase estimation problem is defined as follows: Let U be an n-qubit unitary transformation having 
eigenvalues 

A — e , . . . , A 2 "-i — e 



with corresponding eigenkets 
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where < Ok < 1. Given one of the eigenket \"4>t), estimates the value of Of 

Cleve et al solve this problem as follows: Construct two quantum registers, the first an m-qubit register, and 
the second an n-qubit register. Then construct a unitary transformation c m (U) which acts on both registers as 
follows: 

c m (U) : \k)\i>) -> \k)U k \^) (3) 

where \k) and \ip) denotes respectively the state of the first and second register. The phase estimation algorithm 
can be described as follows: 

Phase Estimation Algorithm: 

Input: U and \ipt) , Output: An estimate of Of 

Note: |ri) (|r2)) is the state of the first register (second register, respectively). 

(1) Let |n)|r 2 ) = |0>|^ t >. 

(2) |n)|r 2 ) = (H® m ®/)|ri)|r 2 ). 

(3) |n)|r 2 ) = c m {U)\n)\r 2 ) . 

(4) |n)|r 2 ) - {QFT- 1 ® I)\n)\r 2 ) . 

(5) j = the result of measuring \r\) 

(6) Output j/2 m . 

Step (1) is an initialization of the registers into the state |0)|?/>t) with input \ip t ). Step (2) applies the 
Hadamard transformation to the first register, leaving the registers in the state 

As a result of applying c m (U) in step (3), the registers are in the state 

|ri>|r 2 > = -L E e W |fc)|Vt). (5) 
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To understand the workings of step (4) , let us assume that t — j/2 m , for some j € {0, . . . , 2 m — 1}. Therefore, 
the equation (5) can be rewritten as: 



| ri )|r 2 ) = -L ]T e*™ k ^ m \m t ). (6) 

vz fe=0 

By applying the inverse quantum Fourier transform in step (4) , the registers are in the state 

|ri>|r 2 ) = |j)hM. (7) 

By making a measurement on the first register in step (5) , we obtain j, where t = j/2 m . 

In general, t may not be of the form of j/2 m . However, the result of applying the inverse QFT in step (4) 
results in j/2 m being the best m-bit estimation of 6t with a probability of at least 4/ir 2 . For more details, please 
consult Cleve et al. 5 A quantum circuit of the phase estimation algorithm is shown in figure 5. 

3.2. Order Finding Algorithm 

The order finding problem is defined as follows: Given a positive integer N and an integer a relatively prime 
to N, find the smallest positive integer r such that 

a r = 1 (mod N). (8) 

First of all, we want a unitary transformation to use in the phase estimation algorithm. We call that unitary 
transformation M a , which is defined as follows: 

M a : \x) -» \ax {mod N)), (9) 




Figure 5. This figure shows the construction of a phase estimation circuit. The m-control U, c m (U), is not shown in 
detaii. However, if we have access to U , where i G {0, 1, . . .}, then c m (U) can be implemented using the method of 
repeated squaring. As a result, j/2 m is the best m-bit estimation of 6 t . 



where \x) is an n-qubit register (the second register). Let ui = e 2 " , and for each k € {0, . . . , 2™ — 1}, define 

l^> = 4l>" s V>- ( 10 ) 

Then, for each t, M a \^p t ) — ^lipt)- In other words, w* is an eigenvalue of M a with respect to eigenvector \ipt)- 
Furthermore, 6 t ~ t/r, for each t. Therefore, if we have given an eigenvector \ipi), and we know how to construct 
c m (M a ), then we can find r (which is the period of a) by using the phase estimation algorithm. 

Unfortunately, it is not trivial to construct \tp t ) for every t. Instead of using |f/ ; t), we use |1) which is effectively 
equivalent to selecting |f/>t), where t is randomly selected from {0, . . . , r — 1 }. Then, we use the phase estimation 
algorithm to compute the value of j/2 m which is the best m-bit estimate value of t/r. We extract the value of 
t/r by using the continued fraction algorithm. If t and r are relatively prime, then we get r, which is the period 
of a. The output r of the phase estimation algorithm can be tested by checking that a r = l(modN). If r is not 
the period of a, then we can re-execute this algorithm until t is coprime to r, which occurs with high probability 
in O(loglogiV) rounds. 5 

In the next section, we describe an implementation of c m (U). This calculation is equivalent to the calculation 
that Shor uses in his factoring algorithm, known as "modular exponentiation." Another detailed implementation 
of the modular exponentiation can be found in Beckman et al. 7 

3.3. An implementation of modular exponentiation 

To complete the implementation of the order finding algorithm, we need to construct the unitary transformation 
c m (M a ). We accomplish this by using the method of repeated squaring. 

Let k m -\k m -i . . . fcifc , be the binary expansion of the contents of the first register, \k). It now follows that 

m— 1 m—1 

M k a = y[m^ = n«) fci 

i=0 i=0 

Then, for each i, we can implement the term (M^ ) ki as a controlled M% , where the control qubit is \ki). 

Please note that a is a constant integer, and that M„ = M a2 i for all < i < m. Therefore, we can 
precompute the value of a 2 by classical computers. Then we can apply the same technique used to implement 
M a , to implement M fl2 i . Figure 6 shows an implementation of c m (M a ). 

3.3.1. Reusing ancillary qubits 

For a given polynomial-time function /, we can construct a unitary transformation F which maps |x)|0) to 
\x)\f(x)}. However, the complete definition of F also includes ancillary qubits which contain information neces- 
sary for F to be reversed. Let g be a function that computes the additional information, called "garbage" . The 
complete definition of F is, 

F:\x)\0)\0)^\x)\f(x))\g(x)), (12) 
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Figure 6. Lot fc m _i . . . fcifco be a binary representation of k. Then = Il^Lo 1 ^a i2 ■ The term M* i2 is implemented 
as a control- circuit, where the control qubit is |fcj). 



The garbage needs to be reset, or erased, to state |0) before we make a measurement. Otherwise, the result 
of the measurement could be affected by the garbage. To erase the garbage, Shor uses Bennett's technique which 
we review in this section. 

First we compute F(x). Once we have the output |/(x)), we copy \f{x)) into the extra register which has been 
preset to state |0). Then we erase the output and the garbage of F by reverse computing F(x). In particular, 
this procedure is described as follows: 

|x)|0)|0)|0) =^ \x)\f(x))\g(x)}\0) 

C 2^ Y \x)\f(x))\g(x))\f(x)) 
=^ |x)|0)|0)|/(x)), 

where F r is the reverse computation of F. We copy |/(x)) to the extra register bit by bit by applying a CNOT 
gate on each qubit. We define XF = F r ■ COPY • F. 

If / is a polynomial-time invertible function, we can create a unitary transformation OF which overwrites an 
input |x) with the output |/(x)). We start from the construction of a unitary transformation FI as follows: 

FI:\x)\Q)^\x)\r\x)), (13) 

where / _1 is a polynomial-time inverse function of /. The transformation FI may generate garbage, but it can 
be erased by using the technique mentioned above. Finally, we implement OF as follows: 

|x)|0) =^ |x)|/(x)) 
^ |/(x))|x) 
^ \f{x))\0), 

where FI r is the reverse computation of FI. The SWAP is a swap gate that swaps the content of the input and 
the output registers. 

3.3.2. Binary adders 

We continue our construction of M a by first implementing "binary adders." There are two types of binary adders, 
"binary full adder" and "binary half adder," denoted by BFA a and BHA a , respectively. The BFA a and BHA a 
are defined as follows: 

BFA a : |c)|6)|0) — > |a © 6© c)\b)\c') BHA a : \c)\b) -» \a © b © c)\b), (14) 

where a is a classical bit, and |c) and |c') are input and output carries, respectively. The circuits for BFA a and 
BHA a are shown in figure 7. 

The dotted-line represents a classical bit a which is used to control the quantum gates. If the classical bit 
is 0, the quantum gate is not applied. If the classical bit is 1, then the quantum gate is applied. The binary 
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Figure 7. The dotted-line represents the classical bit a which is used to control the quantum gates. The BFA a computes 
an additional output carry qubit, while the BHA a does not. 



full adder adds a classical bit a to the carry \c) first, then adds a qubit \b) to the sum. Because the carry is not 
computed by BHA a , we remove two gates (the first gate, and the Toffoli gate) from BFA a in order to implement 
BHA a . 

3.3.3. An n-qubit adder 

For each classical n-bit integer a, an n-qubit full adder FA is the unitary transformation defined by 

FA a :|6)|0)|c)^|6)| S )|c') (15) 

where \s) is an n-qubit register, s = a + b + c (mod 2"), and c and d are an input and output carries, respectively. 
A quantum circuit for FA is shown in figure 8, where a„_i • • • aiao, • • • bibo, and c„_i • • • ciCq are n-bit 
binary representations of a, b and c, respectively. 
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Figure 8. By applying the BFAs bit by bit, as shown in the above circuit, we effectively add an n-bit number a to the 
n-qubit number \b) and the carry |c). The outputs are registers |6), \s), and a new carry |c'), where s — a + b + c(mod 2 n ). 
The thick lines represent an n-qubit register. 



We replace the last BFA are _ 1 with a BHA Qri _ 1 to construct HA a . As a result, we need only n — 1 input 
ancillary qubits with initial state |0) to implement HA a . By including an input carry qubit \c), the HA a is a 
2n-qubit unitary transformation. 

An n-qubit adder modulo N 

We use FA a and HA a to implement the n-qubit adder modulo N, (AN a ). We observe that if a + b < N, then 
a + b (mod N) = a + b (mod 2"); otherwise a + b (mod N) = a + b + 2 n -N (mod 2 n ). We implement AN a as 
follows: First we compute the sum of \b) with a classical number a + 2" — N in modulo 2". If the carry is not 
set, then we subtract 2™ — N from the sum. Hence, we have a transformation AN a , given by 

AN a : |6)|0)|0)|0) -> \b)\s}\c)\a + b (mod N)) . (16) 

where s = b + a + 2" — N (mod 2"), and c is the carry. The circuit that implements AN a is shown in figure 9. 





Figure 9. First, we add a number a + 2 n — N to ket |6). If the carry bit is not set, then we subtract 2 n — N from the 
sum. As a result, we compute a + b (mod N). 



We use the technique described in section 3.3.1 to reset \s) and \c) back to state |0). As a result, we obtain 
a transformation XAN a = AN^ • COPY • AN a which acts as follows: 

XAN a : |6)|0)|0)|0)|0) -> |6)|0)|0)|0)|o + b (mod N)). (17) 

In other words, XAN a is a 2n-qubit transformation (with 2n+l ancillary qubits) which sends |b) |0) to \b) \a + b (mod N)) . 
The wiring diagram for XAN a is shown in figure 10. 

I b) 
|0> 
|0> 
|0> 

|0) $ \a + b (mod N)) 

Figure 10. Using a CNOT gate to copy bit by bit from the output register of AN a to an n-qubit ancillary register, we 
can apply AN a so that |s) and |c) are set to state |0). 
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Because the inverse transformation of XAN a is XAN_ a , the input of XAN Q can be overwritten by the output 
of XAN a by using the technique described in section 3.3.1. We now define the adder A a as follows: 

A a = XAN'1 Q • SWAP ■ XAN Q . (18) 

As a result, the transformation A a is an n-qubit transformation (with 3n + 1 ancillary qubits) which maps \b) 
to \a + b (mod N)). 

3.3.4. An n-qubit multiplier 

Now we are ready to describe the construction of M a , which maps \x) to \ax), where i € Zjy. We define an 
2n-qubit unitary transformation MF a as follows: 

MF a : \x}\0) -> \x)\ax(mod N)). (19) 

Assuming x n -\ . . . xiXq is the binary representation of x, we have 

n-i 

ax = ^2ax l 2\ (20) 

i=0 

For each < i < n, the term axi2 l can be implemented by the control- A a2 i , where \xi) is a control qubit, and 

A>2* : \b) ^\b + a2 l (mod N)). (21) 

Since for each i, a2 % is constant, we can compute each a2 l by using a classical computer. Then we use the result 
and the same technique for implementing A a , as described in section 3.3.3, to construct A a2 i. Therefore, the 
transformation MF a can be implemented using the method of repeated squaring with a circuit similar to the 



circuit shown in figure 6. Hence, MF a is a 2n-qubit transformation sending |x)|0) to |x)|ax), using of 3n + 1 
ancillary qubits. 

Finally, with the overwriting output technique described in section 3.3.1, the transformation M a can be 
implemented as M a = MF a -i ■ SWAP - MF a . (Note that, because a and N are relatively prime, a -1 always 
exists in Zjy.) In other words, M a is an n-qubit transformation with An + 1 ancillary qubits. Thus, the so 
constructed M a can be plugged into the transformation c m (M a ), as described earlier in section 3.3. 

3.4. Complexity analysis 

We analyze the complexity of our implementation of Shor's algorithm for two parameters, i.e., the number of 
gates and the number of qubits. 

Gate complexity 

To count the number of gates, we define a function G(F) to be the number of gates used to implement the 
transformation F. We recursively compute the number of gates as follows: 



Since G(H® m ) = m, GiQFT^ 1 ) = m(m + 1)/2 = 0(m 2 ), and G(c m (M a )) = 70mn 2 -6mn = 0(mn 2 ), it follows 
that the gate complexity of this implementation is 0(mn 2 ). In general, m = In. Therefore, the complexity is 
0(n 3 ). 

However, we count a control-gate with multiple control-qubits as one gate. In fact, a control gate with 
multiple control qubit can be broken down into a sequence of Toffoli gates using the techniques described by 
Beranco et al. 8 Moreover, the number of needed Toffoli gates grows exponentially with respect to the number of 
control qubits in the control-gate. Fortunately, the number of control qubits in the Shor's algorithm is at most 5: 
One control qubit for c m (M a ), one control qubit for M a , one control qubit for control- FA a in the implementation 
of AN a , and two control qubits in the implementation of FA a . Moreover, the number of control qubits does not 
depend on the input number N. Therefore, there is constant overhead from breaking down a control gate with 
multiple control qubits into a sequence of Toffoli gates. This overhead does not have affect the gate complexity. 

Space complexity 

First of all, XAN is a 2n-qubit transformation with 2n + l ancillary qubits. So, we need n qubits to control the 
transformation A a in the implementation of M a , and m more qubits to control the transformation M a in the 
implementation of c m (M a ). Therefore, the number of qubits needed in this implementation is 5n + m + 1. 



We implement a distributed quantum factoring algorithm as briefly described as follows: First, we partition 
5n + m + 1 qubits into groups in such a way that each group fits on one of the quantum computers making up a 
network. Then, we implement a distributed quantum factoring algorithm on this quantum network by replacing 
a control gate with a non-local control gate, whenever necessary. 

In this paper, we will describe a distributed quantum factoring algorithm to factor a number N within specific 
parameters. We assume that we have a network of (n + c)-qubit quantum computers, where n — log N. The c 
extra qubits for each computer can be used as either channel qubits or ancillary qubits. We will show that c is 
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4. DISTRIBUTED QUANTUM FACTORING ALGORITHM 



a constant which does not depend on the input number N. To be more specific, we choose m = In. Therefore, 
the number of qubits needed in this implementation is In + 1 qubits. Although, this particular implementation 
is specific to certain parameters, its implementation can easily be generalized. 

First we divide the control register of c m (M a ), |fc), into two n-qubits groups. Then we place these two groups 
on two different computers. Each qubit of these two groups remotely controls the transformation M% . 

Another computer is assigned to hold the control register of MF ai i.e., \x). Each qubit \xj) remotely controls 
the transformation A a . 

Next, we implement the transformation XAN a , which is a component of AN a . The transformation XAN a 
has two registers, one n-qubit input register \b), and one n-qubit output register \a + b). However, XAN a also 
requires 2n + 1 ancillary qubits, i.e., one carry bit, n qubits for the intermediate sum |s), and n qubits for the 
intermediate output register \a + b). Therefore, it takes four computers to compute XAN a . Each computer 
computes 1/4 of each register, as shown in figure 11. Each computer holds n/4 qubits from the input registers 
\b), n/4 qubits from the intermediate sum register |s), n/4 qubits from the intermediate output register, and n/4 
qubits from the output register \a + b) (represented by thick lines). Each computer also has two extra carries 
qubits, which are used in computing of FA and FA'. 
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Figure 11. This figure shows how to compute AN a followed by COPY transformations. Each computer holds n/4 qubit 
from each register. Each computer also has two carry qubits which have been set to |0). The arrow line represents 
teleportation of the output carry qubit to the next computer. Each transformation is remotely controlled by two qubits, 
one from register |fc), and the other from the register \x). 



The first four FA transformations compute FA a+2 »-Ar with the input carry |c) = |0). These FAs are remotely 
controlled by two control qubits, one from the register \k), and the other from the register |a;). A distributed 
control FA with two control qubits is implemented by distributing two control qubits onto the computer that 
holds the target qubits, and then implementing the double control locally, as shown in figure 12. After completing 
each FA computation, the output carry bit is teleported to the next FA on another computer. The teleportations 
are represented by arrow lines. 

The transformation HA_^-n) is computed by the next three full adders FA', and a half adder HA. The 
integer —(2™ — N) is precomputed by a classical computer, and then used to implement FA and HA. Similarly, 
the carry qubit is teleported from one computer to another. The last carry qubit is teleported into the first qubit 
of the intermediate output register |(o + b) 3n /i... n _i). 
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Figure 12. A distributed multi-control gate can be implemented by distributing all control qubits to the computer 
that hold the target qubits, then implementing the multi-control gate locally. This figure shows how to implement an 
distributed control-control-FA gate. 



Each FA' and the single HA are each controlled by three qubits: One from the first register |fc), one from the 
register |a;), and the last from the output carry bit of FA a+a >«_Ar. A non-local three-control gate can implemented 
by distributing all three-control qubits onto the target computer, and locally implementing the control gate with 
three control qubits. 

The COPY transformation is a bitwise copy implemented in terms of CNOT gates. Because each computer 
possesses n/4 of intermediate output register and the final output register itself, the distributed COPY can be 
easily implement by locally applying CNOT gates, as shown in figure 11. However, COPY still needs to be 
remotely controlled by two qubits from register \k) and register \x). 

Similarly, each machine possesses n/A qubits of both input register and output register. The distributed 
SWAP can be locally implement on each machine, remotely controlled by two qubits from register \k) and 
register |a;). 

The number of extra qubits 

The number of extra qubits c depends on two factors: The number of channel qubits, and the number of extra 
carry qubits needed in the implementation. The number of channel qubits depends on how many non-local 
control qubits are needed. In this implementation, at most three non-local control qubits arc implemented. 
Therefore, at most 3 channel qubits are required at one time. Furthermore, there are only two extra carry 
qubits (one carry qubit for transformation FA and another carry qubit for transformation FA') needed in this 
implementation. Therefore, c = 5, and does not depend on the input N. 

4.1. Communication complexity 

By communication complexity, we means the number of entangled pairs needed to be established, and the 
number of classical bits needed to be transmitted in each direction. The optimum cost of implementing a non- 
local operation is one EPR pair and two classical bits (one in each direction). Therefore, if we can count the 
number of non-local control gates and teleportation circuits, we can estimate the communication overhead. The 
communication overhead of a control gate with multiple control qubits (such as control-FA with two control 
qubits) is equal to the overhead for a single non-local CNOT gate multiplied by the number of control qubits. 
Fortunately, the maximum number of non-local control qubits is at most 3. Therefore, we can count every gate 
as one control gate. 

If we simply count every gate as a non-local gate, the communication overhead is 0{mn 2 ). This number is 
an over estimation because the cost of each non-local control-?/ gate, where U can be decomposed into a number 
of elementary gates, can be shared among these elementary gates. 



To be more precise, we define a function NL(F) to be the number of non-local control gates implemented in 
the distributed implementation of circuit F. We compute NL(SHOR) as follows: 



As shown in figure If, there arc 8 non-local control gates per AN a , i.e., NL(AN a ) = 8. (The nondocal 
control NOT gate in the middle can be included in the implementation of the last nondocal control FA.) Four 
nondocal control circuits are sufficient to implement COPY. Similarly, another four nondocal control circuits 
are sufficient to implement SWAP. Therefore, NL(c m (M a )) = Umn = 0(mn). Since N L{QFT~ V ) = 0(m 2 ), 
then NL(SHOR) = 0(mn + to 2 ). 

Similarly, we define a function T(F) to be the number of teleportation circuits implemented in the distributed 
implementation of circuit F. Then, six teleportation circuits are sufficient to implement AN a . There is no need 
for a teleportation circuit in COPY, SWAP, and QFT' 1 . Therefore, T(SHOR) = 12mn = 0{mn). 

As a result, the communication complexity of Shor is NL(SHOR) + T(SHOR) = 0(mn + to 2 ). In this 
particular implementation, to = 2n. Hence, the communication over is 0(n 2 ). 
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